compositional zero-shot recognition
A causal view of compositional zero-shot recognition
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not essential for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components. Here we describe an approach for compositional generalization that builds on causal ideas. First, we describe compositional zero-shot learning from a causal perspective, and propose to view zero-shot inference as finding which intervention caused the image?. Second, we present a causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data. We evaluate this approach on two datasets for predicting new combinations of attribute-object pairs: A well-controlled synthesized images dataset and a real world dataset which consists of fine-grained types of shoes. We show improvements compared to strong baselines.
Review for NeurIPS paper: A causal view of compositional zero-shot recognition
Weaknesses: * This method is most suitable for variables that have a single parent in the causal DAG -- the class label. This severely restricts the class of attributes that can be modeled and manifests in the paper as experiments with simple attributes (colors in AO-CLEVr, and materials in Zappos). In fact, prior work has noted that attributes (or other compositional modifiers) manifest very differently for different objects ([36] gives the examples from prior work: "fluffy" for towels vs. dogs, "ripe" for one fruit vs. another etc.). For these attributes, and many others, the data generating process is not so straightforward -- there are edges from both attribute labels and object labels to the core features. The authors do acknowledge this limitation in L326, however it is an important weakness to consider given that _difficult_ instances in real world datasets (where both object and attribute are parents of \phi_a for example) are fairly prevalent.
Review for NeurIPS paper: A causal view of compositional zero-shot recognition
All four reviewers appreciated the neat idea contained in this paper which is also shown to work well in practice. The authors open up the way for studying data generation processes through causal interventions, which is a novel and technically interesting direction. Most importantly, it is a significant direction which is expected to stimulate further research in the field. I am recommending acceptance of this paper, however please consider revising the manuscript to address R4's remarks about clarity and R2's and R3's remarks about deeper discussion of failure cases and limitations.
A causal view of compositional zero-shot recognition
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not "essential" for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components. Here we describe an approach for compositional generalization that builds on causal ideas.